Skip to content

8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation #26171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

pengxiaolong
Copy link

@pengxiaolong pengxiaolong commented Jul 7, 2025

Shenandoah always allocates memory with heap lock, we have observed the heavy heap lock contention on memory allocation path, this change is to propose an optimization for the code path for mutator memory allocation to improve heap lock contention.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26171/head:pull/26171
$ git checkout pull/26171

Update a local copy of the PR:
$ git checkout pull/26171
$ git pull https://git.openjdk.org/jdk.git pull/26171/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26171

View PR using the GUI difftool:
$ git pr show -t 26171

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26171.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 7, 2025

👋 Welcome back xpeng! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 7, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Jul 7, 2025

@pengxiaolong The following labels will be automatically applied to this pull request:

  • hotspot-gc
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-gc hotspot-gc-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Jul 7, 2025
Copy link
Contributor

@kdnilsen kdnilsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this work. It is huge progress. I've left a number of comments. I didn't have time to study/comment on all of the code, as I am out-of-office for rest of this week. I'll look more when I return.

@@ -244,21 +244,18 @@ void ShenandoahRegionPartitions::establish_mutator_intervals(idx_t mutator_leftm
_leftmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_leftmost_empty;
_rightmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_rightmost_empty;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the heap lock here, so should not need to use atomic store operations. Atomic operations have a performance penalty that I think we want to avoid.

_capacity[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_region_count * _region_size_bytes;
_available[int(ShenandoahFreeSetPartitionId::Mutator)] =
_capacity[int(ShenandoahFreeSetPartitionId::Mutator)] - _used[int(ShenandoahFreeSetPartitionId::Mutator)];
Atomic::store(_region_counts + int(ShenandoahFreeSetPartitionId::Mutator), mutator_region_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do need to use Atomic operations, would prefer &_region_counts[int(ShenandoahFreeSetPartitionId::Mutator)] notation for the array elements.

}

void ShenandoahRegionPartitions::increase_used(ShenandoahFreeSetPartitionId which_partition, size_t bytes) {
shenandoah_assert_heaplocked();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you now calling this directly from the CAS allocators? So you want to not have to assert heap locked and that is why we make the used accounting atomic?

My preference would be to avoid this need by counting the entire region as used at the time the region becomes directly allocatable.

@@ -621,18 +612,31 @@ void ShenandoahRegionPartitions::assert_bounds() {
{
size_t capacity = _free_set->alloc_capacity(i);
bool is_empty = (capacity == _region_size_bytes);
assert(capacity > 0, "free regions must have allocation capacity");
// TODO remove assert, not possible to pass when allow mutator to allocate w/o lock.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the preferred approach here is to "pre-retire" regions when they are made directly allocatable. When the region is pre-retired, it is taken out of the partition, so assert_bounds no longer applies to this region.

@@ -78,10 +79,9 @@ class ShenandoahRegionPartitions {
// are denoted in bytes. Note that some regions that had been assigned to a particular partition at rebuild time
// may have been retired following the rebuild. The tallies for these regions are still reflected in _capacity[p]
// and _used[p], even though the region may have been removed from the free set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer not to make these volatile, as that imposes a compiler overhead.

_available[int(which_partition)], _capacity[int(which_partition)], _used[int(which_partition)],
partition_membership_name(ssize_t(which_partition)));
return _available[int(which_partition)];
return capacity_of(which_partition) - used_by(which_partition);
}

// Return available_in assuming caller does not hold the heap lock. In production builds, available is
// returned without acquiring the lock. In debug builds, the global heap lock is acquired in order to
// enforce a consistency assert.
inline size_t available_in_not_locked(ShenandoahFreeSetPartitionId which_partition) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are beyond the scope of planned topic. I think we need to consider them more carefully. Would prefer not to mix the two. (and I personally believe the original implementation has better performance, but feel free to prove me wrong.)

}

template<bool IS_TLAB>
HeapWord* ShenandoahFreeSet::par_allocate_single_for_mutator(ShenandoahAllocRequest &req, bool &in_new_region) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me what prefix par_ represents. Parallel allocate (without lock?)

uint count = 0u;
for (uint i = 0u; i < max_probes; i++) {
ShenandoahHeapRegion** shared_region_address = _directly_allocatable_regions + idx;
ShenandoahHeapRegion* r = Atomic::load_acquire(shared_region_address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has a lot more synchronization overhead than what is required for CAS allocations. load_acquire() forces a memory fence. All writes performed by other threads before the store_release() must be visible to this thread upon return from load_acquire. I would like to see some documentation that describes the coherency model that we assume/require here. Can we write this up as a comment in the header file?

Personal preference: I think there are many situations where we get better performance if we allow ourselves to see slightly old data, and we can argue that the slightly old data is "harmless". For example, if some other thread replaces the directly_allocatable_region[N] while we're attempting to allocate from directly_allocatable_region[N], we might attempt to allocate from the original region and fail. That's harmless. We'll just retry at the next probe point. If multiple probes fail to allocate, we'll take the synchronization lock and everything will be resolved there. The accumulation of atomic volatile access has a big impact on performance. I've measured this in previous experiments. You can do the same.

return allocate_for_mutator(req, in_new_region);
}
// If any of the 3 consecutive directly allocatable regions is ready for retire and replacement,
// grab heap lock try to retire all ready-to-retire shared regions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be preferable to allow this thread to retire all ready-to-retire regions in the directly allocatable set (not just the three that I know about) while it holds the heap lock. We do not necessarily need to keep a separate per-thread representation of ready-to-retire shared regions. This is a very rare event. Just iterate through the 13 (or whatever) regions in the directly allocatable set and ask for each whether end-top is less than min plab size.

@@ -287,6 +271,28 @@ class ShenandoahRegionPartitions {
void assert_bounds() NOT_DEBUG_RETURN;
};

#define DIRECTLY_ALLOCATABLE_REGION_UNKNOWN_AFFINITY ((Thread*)-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of time to dive deep into this right now. Wonder if it makes sense to randomly generate a hash for each thread and store this into a thread-local field. Might provide "randomness" and locality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

2 participants